Decentralized Online Learning of Stackelberg Equilibrium in General-Sum Leader-Controller Markov Games
Currently Under Review
A brief game theory supplement
I want to go watch the Dodgers game, but Lynn really wants to go to Erewhon. However, if we both go to the Dodgers game, Lynn will be less happy than me, and similarly, if we both go to Erewhon, I’ll be less happy than Lynn. Yet we still want to go somewhere together; not going together is worse than going together anywhere.
Such a simple interaction between Lynn and I is a case of the “Battle of the Sexes”, a classic example from the field of game theory, which studies how individuals make decisions when their outcomes depend on the decisions of others as well. Games can come in many different forms, yet each one has the following standard “ingredients”:
- The players. Who are the individuals in the game interacting with each other? For example, Lynn and I.
- The actions. What can each player do? For example, either one of us can either go to the Dodgers game or Erewhon.
- The information. What does each player know? For example, do I know Lynn has already decided to go to Erewhon?
- The preferences. Which outcomes do the players prefer over others? For example, I prefer both of us going to the Dodgers game versus both of us going to Erewhon, but Lynn is the opposite. Each outcome can be valued through the payoff the player receives.
With these ingredients, we can express this game as a “payoff matrix”, where the rows correspond to one player’s actions (me) and the columns correspond to the other player’s actions (Lynn). The left value in each cell corresponds to the payoff I receive if we take those respective actions, and the right value corresponds to the payoff Lynn receives. So I receive more payoff than Lynn if we both go to Dodgers, while vice versa if we both go to Erewhon.

Players thus decide on strategies or policies, which dictate what actions they should play. These could be pure strategies (e.g. I simply choose “Dodgers”) or mixed strategies (e.g. I choose “Dodgers” 50% of the time and “Erewhon” the other 50%).
As with any other problem, we need a solution concept. In games, economists often look for equilibria, which roughly speaking, are points of “stability”. The most famous is the Nash equilibrium, which is defined as a set of strategies where no player benefits from a unilateral deviation.
This is best seen through an example of a Nash equilibrium in the game above. Consider the pure strategy profile (Dodgers, Dodgers). Consider the unilateral deviations, where only one player switches their action. If I choose Erewhon, my payoff decreases from \(2\) to \(0\), so I would not benefit. Similarly, Lynn would not benefit by switching to Erewhon, as her payoff would decrease from \(1\) to \(0\). Thus, (Dodgers, Dodgers) consistutes a pure-strategy Nash equilibrium.
For a similar reason, (Erewhon, Erewhon) is also a Nash equilibrium. Suprisingly, there is in fact a third Nash equilibrium consisting of mixed strategies, where neither player’s expected payoff increases from a unilateral deviation. This is not really relevant for this paper, but still a very important concept within game theory! Try to find it as a challenge.
Stackelberg/sequential games
The Stackelberg model is a special kind of game-theoretic model that situates players within a “leader-follower” dynamic. For example, firm \(F\) could be choosing to enter a market by producing some nonzero amount of product. However, if there is already firm \(L\) in the market, they have the power of choosing how much to produce first. If firm \(L\) is strategic, they could produce a larger than normal quantity, lowering the market price of the product, which deincentivizes the entry of firm \(F\) and eliminates competition. See here for a worked out theoretical example.
More generally, a sequential game is one where one player(s) (the “leader(s)”) has the ability to move first, then the other player(s) (the “follower(s)”) observes what the leader takes and then takes their action. When deciding strategies, the follower’s strategy is now conditional on that of the leader. A strategy for the follower thus needs an action for each potential leader action. For example, if Lynn is the leader, one strategy would be to play “Erewhon”, while I as the follower could take the strategy {Erewhon | Erewhon, Dodgers | Dodgers}, where \(a | b\) means I choose action \(a\) if Lynn chooses action \(b\).
In the “Battle of the Sexes” example, if Lynn is the leader, she has the ability to choose “Erewhon” first, and so I, trying to maximize my payoff, choose “Erewhon” as well as my best-response. If she chose “Dodgers”, then my best-response would be “Dodgers”. Lynn, realizing this, uses backwards induction and determines that she should choose “Erewhon” to maximize her payoff.
In this sense, (Erewhon, {Erewhon | Erewhon, Dodgers | Dodgers}) constitutes a Nash equilibrium for this sequential game. Similarly, (Erewhon, {Erewhon|Erewhon, Erewhon|Dodgers}) would also be an equilibrium, since Lynn would not want to switch to Dodgers, and I would not want to change my strategy conditional on Lynn playing Erewhon either.
Interestingly, (Dodgers, {Dodgers | Erewhon, Dodgers | Dodgers}) is also an equilibrium! (Can either of us benefit from a unilateral deviation of our strategy?) But Lynn should have the intuitive advantage of choosing Erewhon first. What’s happening here is that I am essentially “threatening” Lynn to play Dodgers by saying I will still choose Dodgers if she chooses Erewhon, even if it’s worse off for me. Some game theorists don’t really find this solution intuitive (of course, Lynn would see through my bluff immediately), so there is an alternate equilibrium concept in sequential games known as a subgame-perfect Nash equilibrium that eliminates these so called “non-credible threats”.
That’s been a brief background of some of the game theory background you need for this paper (which in essence, is still a reinforcement learning paper). I’d like to end this section with some remarks:
- One key note to mention here is in these settings, players had full information about their own and other players’ payoffs. When we adapt this setting into a learning problem (i.e. the crux of this paper), we assume these are unknown to the players, and they have to learn the payoffs through sampling. This is one key difference between how economists and learning researchers can study games.
- While we can study the existence and uniqueness of equilibria, game theory doesn’t really tell us which equilibrium the game will settle to, nor does it tell us how these systems evolve to reach these equilibria. Again, equilibria are just points of “stability”. In the simulatenous game, (Dodgers, Dodgers) is a better equilibrium for me, but (Erewhon, Erewhon) is better for Lynn. Which one will the game settle to?
- However, in algorithmic game theory, where this economic intuition is less emphasized and we just try to learn and converge to equilibria (which are usually unique in a game), this learning process is more clear.
Learning Equilibrium in Stackelberg Markov Games
At last we reach the reinforcement learning theory. In algorithmic game theory, our goal is to develop algorithms for some game that dictate what policies the players take, such that over time, they minimize players’ regret and/or learn and converge to some equilibrium within some degree of approximation. For this paper, we adapt the Stackelberg game model into a two-player Markov decision process (MDP), where the rewards and transition dynamics are dictated by the joint actions of both players. For a background to MDPs, I recommend checking out this blog, where I briefly introduce the finite, tabular, single-agent MDP. The key distinctions here are that in any given state, the leader first takes an action, then the follower sees this action and responds with an action of their own. Both the leader and follower have the own respective payoff/loss functions which depend on both players’ actions. Neither player knows their own loss functions; they can only observe the actions being played and the noisy losses they receive.
Leader-controller Markov games and game reward structures
For this Markov game in particular, we impose a leader-controller assumption, where the transition dynamics only depend on the action the leader takes. Allow me to discuss why such an assumption is important.
As we’ve seen before, in a game, the rewards of one player depend on the very potentially arbitrary actions of the other player. While the rewards for some joint-action can be constant or come from some fixed distribution, this dependence on other player’s actions inherently makes the reward observations somewhat arbitrary, taken from the reference from of a single player. Thus, games ride this weird “middle-ground” between stochastic and adversarial-based rewards. It also should then be no surprise that in standard bandit games, algorithms designed to converge to Nash equilibria use the exponential weights algorithm for each player, an algorithm designed for regret minimization in adversarial environments.
When we move to Markov games however, both players now can affect the transition dynamics, making the transitions potentially arbitrary from the reference frame of one player. In this same blog, we deal with the corrupted setting where the transitions are potentially arbitrary. Importantly, in these settings, if the level of corruption in the transitions is high, players can be doomed to experience linear regret.
Thus, by suppressing the transitions’ dependence on the follower, we make both regret minimization and equilibrium learning for the players feasible in Markov games.
What’s Stackelberg equilibrium in Markov games?
Recall that Stackelberg equilibrium constitutes both player’s choosing policies that best-respond to the other player.
For the follower, since their actions no longer influence the transitions, their environment is dictated by what state they are in and what action the leader takes. In this sense, their learning problem reduces to parallel stochastic bandits for each state, leader-action tuple they can encounter. We make the standard assumption that the follower’s best-response to any leader action is unique, and thus define the follower’s optimal or , \(\pi^f_\star\), to be that for which for each state-leader action pair \((s,a)\),
\[\begin{equation} \pi^f_\star(s,a) := \underset{\pi^f}{\arg\min} \ \mathbb{E}_{k \sim \pi^f(\cdot|s,a)}\left[l^f(s,a, k)\right]. \end{equation}\]
For the leader, their learning problem is now akin to an adversarial MDP. Given a leader loss function \(\ell\), leader policy \(\pi\), and follower policy \(\pi^f\), we can define the leader’s expected sum of losses for the remainder of an episode starting from state \(s_{h'}\) in some layer \(h'\), \(h'=0, 1, \ldots, H-1\), as: \[\begin{equation} V^{\pi, \pi^f}(s_{h'};\ell) := \mathbb{E}_{\substack{a_h \sim \pi(\cdot|s_{h}), \\ k_h \sim \pi^f(\cdot|a_h, s_h), \\ s_{h+1} \sim P(\cdot |s_{h}, a_h)}} \left[\sum_{h=h'}^{H-1} \ell(s_{h},a_h, k_h) \right] \end{equation}\] where \(V\) is called the leader’s state value function or \(V\)-value function, given the leader follows policy \(\pi\) and the follower follows policy \(\pi^f\).
Thus, we say \((\pi_\star, \pi^f_\star)\) constitutes a Stackelberg equilibrium (SE) if the leader policy \(\pi_\star\) is in \(\arg\min_\pi V^{\pi, \pi^f_\star}(s)\) for all states \(s\). In particular, for the \(T\)-episodic Stackelberg Markov game, we say that \(\pi_T\) is an if for all \(s \in \mathcal{S}\), the value functions at state \(s\) under \((\pi_T, \pi^f_\star)\) and some SE \((\pi_{\star}, \pi^f_\star)\) are \(\varepsilon\) apart, i.e. \(V^{\pi_T, \pi^f_\star}(s) \leq V^{\pi_\star, \pi^f_\star}(s) + \varepsilon\).
Our goal is to develop algorithms, in particular, decentralized algorithms, that provably allow the players to converge to a Stackelberg equilibrium while minimizing individual player regret.
Our contribution
We frame learning Stackelberg equilibrium as a reinforcement learning problem. Namely, we first define MDP Stackelberg regret as:
\[\begin{equation}\label{eq: stackelberg_regret} R_{\text{Stack}}(T) := \sum_{t=1}^T V^{\pi_t, \pi^f_\star}(s_0; \ell) - \min_{\pi} \sum_{t=1}^T V^{\pi, \pi^f_\star}(s_0; \ell). \end{equation}\]
The second term of \(R_{\text{Stack}}(T)\) are the total leader losses in SE, and so we see algorithms whose MDP Stackelberg regret scale sub-linearly with \(T\) converge to a SE. Note that in the first term of \(R_{\text{Stack}}(T)\), the follower’s best-response policy is fixed as well, which differentiates Stackelberg regret analysis from that of a traditional single-agent adversarial MDP, where the first term would instead be the expected incurred losses of the leader during play (with the follower playing \(\{\pi^f_t\}_{1\leq t \leq T}\)). Stackelberg regret thus intuitively incorporates the “learning capacity” of the regret-minimizing follower; indeed, if \(\pi^f_t \neq \pi^f_\star\) for many \(t\), this may prevent the leader from adequately learning \(\pi_\star\) themselves, leading to increased Stackelberg regret.
Our algorithms for achieving sublinear Stackelberg regret are rather intuitive. First, the leader follows a standard adversarial MDP algorithm. This can be efficiently implemented through the policy optimization scheme in Luo et al. 2021 or the sample-efficient UOB-REPS algorithm (Jin et al. 2020). Meanwhile, the follower runs parallel UCB algorithms for each state-leader action environment to learn the best-response policy over time. Both of these algorithms ensure sublinear regret for each of the individual players.
The last key detail to adapt our algorithm for Stackelberg regret is adding in an exploration term to the leader’s policy that uniformly explores some actions with some probability. The point of this is to ensure each state-leader action environment is adequately reached by the players, and thus, the follower has enough samples within that environment to confidently learn the best-response within some small error. While this exploration will incur high regret, choosing the exploration term to be on the order of \(\mathcal{O}(T^{1/3})\) ensures enough exploration while still achieving sublinear Stackelberg regret.
Our main result for our main algorithm (Algorithm 1 in our paper) is the following: in the \(T\)-episodic Stackelberg Markov game setting , if the leader and follower apply Algorithm 1 and set the exploration parameter \(\alpha = \mathcal{O}(H^{-1}SA^{\frac{2}{3}}(\log A)^{\frac{1}{3}} T^{-\frac{1}{3}})\), with probability at least \(1-\mathcal{O}(\delta)\), the MDP Stackelberg regret is bounded as: \[ R_{\text{Stack}}(T) \leq \widetilde{\mathcal{O}}\left(\frac{H^2SA^{\frac{4}{3}}KT^{\frac{1}{3}}}{\beta\varepsilon^2} + H^4 + H^2SA^{\frac{2}{3}}T^{\frac{2}{3}} \right) \] where \(H\) is the episode length, \(S\) is the size of the state space, \(A\) is the size of the leader action set, \(K\) is the size of the follower action set, and \(\beta\) and \(\varepsilon\) are setting-dependent gap constants, and where \(\widetilde{\mathcal{O}}\) hides logarithmic dependencies.
Notably, we develop algorithms that establishe the first sublinear Stackelberg regret guarantee for decentralized Stackelberg Markov games setting on the order of \(\mathcal{O}(T^{2/3})\) with high probability. As a corrolary, we show how the players can learn an \(\varepsilon\)-approximate Stackelberg equilibrium in the \(T\)-episodic SLSF Stackelberg MDP setting if both employ Algorithm 1, with a longer horizon \(T\) resulting in a better approximation. To see this, define the time-averaged policy or average-iterate policy \(\bar{\pi}\) given by: \[\begin{equation} \bar{\pi}(\cdot|s) = \frac{1}{T}\sum_{t=1}^T\widetilde{\pi}_t(\cdot|s). \end{equation}\] for all states \(s\). After \(T\) episodes, the leader can implement \(\bar{\pi}\) (with the follower best-responding) by sampling \(t\) uniformly from \(\{1, \dots, T\}\) at each state and then sampling \(a \sim \widetilde{\pi}_{t}(\cdot | s)\). We thus have the following result:
If the leader and follower employ Algorithm 1 and set \(\alpha = \mathcal{O}(H^{-1}SA^{\frac{2}{3}}(\log A)^{\frac{1}{3}} T^{-\frac{1}{3}})\), then with probability at least \(1-\mathcal{O}(\delta)\), where \(\varepsilon_T = \widetilde{\mathcal{O}}\left(\frac{H^3SA^{\frac{4}{3}}KT^{-\frac{2}{3}}}{\beta\varepsilon^2} + \frac{H^5}{T} + H^3SA^{\frac{2}{3}}T^{-\frac{1}{3}} \right)\).
Proofs
I’ll present a high-level overview of the proof of our main sublinear Stackelberg regret result, also detailed in our paper. We first decompose \(R_\text{Stack}(T)\) into the terms: \[\begin{align*} R_{\text{Stack}}(T) &=\sum_{t=1}^T \left(V^{\pi_t, \pi^f_*}\left(s_0\right) - V^{\pi_*, \pi^f_*}(s_0)\right) \\ & =\underbrace{\sum_{t=1}^T \left(V^{\pi_t, \pi^f_*}\left(s_0\right)-V^{\pi_t, \pi^f_t}\left(s_0\right)\right)}_{\text{(Term I)}}+\underbrace{\sum_{t=1}^T \left(V^{\pi_t, \pi^f_t}\left(s_0\right)-V^{\pi_*, \pi^f_t}\left(s_0\right)\right)}_{\text{(Term II)}}+\underbrace{\sum_{t=1}^T \left(V^{\pi_*, \pi^f_t}\left(s_0\right)- V^{\pi_*, \pi^f_*}\left(s_0\right)\right)}_{\text{(Term III)}} \end{align*}\] Terms I and III capture the learning of the follower. We first bound the empirical values of these terms by the number of times the follower does not best-respond to the state-leader actions visited, which are small by 1) standard UCB analysis on the number of pulls needed to learn a best-response given any state-leader action environment, and 2) the explicit \(\alpha\) exploration parameter, set to \(\mathcal{O}(T^{-\frac{1}{3}})\), which allows each state-leader action to be visited enough times across so that the follower can adequately learn the best-response. Combining this with Azuma’s concentration inequality to bound the expected values, we get that \(1-\mathcal{O}(\delta)\) probability: \[\begin{align*} &\text{Term I}\leq \mathcal{O}\left(SAK\left(\frac{\log T}{\varepsilon^2}+\log\left(\frac{SAK}{\delta}\right)\right)+H\sqrt{T\log\left(\frac{1}{\delta}\right)}\right) \\ &\text{Term III} \leq \mathcal{O}\Biggl(H^2A^{\frac{4}{3}}KT^{\frac{1}{3}}\frac{\log(1/\delta)}{\beta} \cdot \left(\frac{\log T}{\varepsilon^2}+\log\left(\frac{SAK}{\delta}\right)\right) + H\sqrt{T\log\left(\frac{1}{\delta}\right)}\Biggr). \end{align*}\] Term II draws back to the motivation in using adversarial MDP algorithms for the leader. We first suppress the value functions’ dependence on the follower’s policies \((\pi^f_t)\) by treating the leader’s losses as adversarial, thus bounding Term II by the the regret of a single-agent adversarial MDP. From here, the challenge lies in bounding the additional regret incurred by the \(\alpha\) exploration taken by \(\widetilde{\pi}_t\), rather than using the unmodified \(\pi_t\). We use the well-known performance difference lemma to bound Term II by: \[ \text{Term II} \leq \sum_{s\in\mathcal{S}} q^{\pi_\star}(s) \sum_{t=1}^T \sum_{a \in \mathcal{A}} \left(\widetilde{\pi}_t(a| s)-\pi_\star(a| s)\right) Q_t^{\widetilde{\pi}_t}(s,a) \] where \(q^\pi(s)\) is the probability of visiting state \(s\) following policy \(\pi\) and \(Q_t^\pi\) is the (hypothetical) adversarial \(Q\)-value function that has suppressed the follower’s policies. Using the definition of \(\widetilde{\pi}\), we can further bound: \[\begin{align*} &\text{Term II}\leq \underbrace{\sum_{s\in\mathcal{S}} q^{\pi_\star}(s) \sum_{t=1}^T \sum_{a \in \mathcal{A}} \left(\pi_t(a| s)-\pi_\star(a| s)\right) Q_t^{\widetilde{\pi}_t}(s,a)}_{(\star)} + \mathcal{O}(\alpha TH) \end{align*}\] Next, we capture the remaining exploration regret within the EXPLORE terms in the following decomposition of \((\star)\): \[\begin{align*} (\star) &= \underbrace{\sum_{s\in\mathcal{S}} q^{\pi_\star}(s) \sum_{t=1}^T \langle\pi_t(\cdot| s) - \pi_\star(\cdot|s), Q_t^{\pi_t}(s,\cdot)\rangle}_{\text{REG-TERM}} \\ &+ \underbrace{\sum_{s\in\mathcal{S}} q^{\pi_\star}(s) \sum_{t=1}^T \langle\pi_t(\cdot| s), Q_t^{\widetilde{\pi}_t}(s,\cdot)-Q_t^{\pi_t}(s,\cdot)\rangle}_{\text{EXPLORE}_1} \\ &+ \underbrace{\sum_{s\in\mathcal{S}} q^{\pi_\star}(s) \sum_{t=1}^T \langle \pi_\star(\cdot|s), Q_t^{\pi_t}(s,\cdot) - Q_t^{\widetilde{\pi}_t}(s,\cdot) \rangle}_{\text{EXPLORE}_2} \end{align*}\] REG-TERM can at last be bounded using the \(\widetilde{\mathcal{O}}\left(H^4 + H^2S\sqrt{AT}\right)\) regret guarantee of Luo et al. 2021, and we use a backwards induction technique to show that both EXPLORE\(_1\) and EXPLORE\(_2\) can simulatenously be bounded by \(\alpha\cdot\mathcal{O}(TH^3)\). Chaining everything together, we show that by choosing \(\alpha = \mathcal{O}(H^{-1}SA^{\frac{2}{3}}(\log A)^{\frac{1}{3}} T^{-\frac{1}{3}})\), we have that: \[ \text{Term II} \leq \widetilde{\mathcal{O}}\left(H^4 + H^2SA^{\frac{2}{3}}T^{\frac{2}{3}}\right). \] Chaining terms I, II, and III gives our theorem.
Final remarks
This was my first project at the intersection of reinforcement learning and game theory. We spent many months revising our problem formulation, as there always seemed to be some difficulty with integrating the Stackelberg setting into a Markov game, whether it was the arbitrary transitions, reward structures, or dealing with the exploration parameter.
However, the result turned out nice. I definitely see a lot of room for improvement in the assumptions and regret bounds, yet for a first project in algorithmic game theory, I learned a lot and was able to see how far I’ve come since one year ago in reinforcement learning theory. I’d like to thank William Chang for his mentorship over this past year, and to my colleagues Ricardo Parada, Helen Yuan, and Larissa Xu.
As always, if you are a new member to BruinML starting on any projects related to RL or game theory, love to explain or hear your ideas over email (kennyguo@ucla.edu) or Discord.